Data Gathering

To see all raw data gathered click here

Quarterly and Annual Ridership Totals by Mode​ of Transportation 1

The initial piece of data that was gathered comes from the American Public Transportation Association, and can serve as an introductory synopsis of the state of public transit ridership over time. This gives a broad view of quarterly ridership across the entire country from 1990 onward. Thus, this data has been chosen for the potential of setting the stage for the problem which we intend to explore.

The raw data and methodology for how it was obtained can be found using this link: https://www.apta.com/research-technical-resources/transit-statistics/ridership-report/

The data itself can be downloaded using this link: https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx

To download this data, I used an R API tool, which saves the data in Excel format. Below is the code for this action and a screenshot of the raw data to illustrate its form upon download:

Code
library(readxl)
library(httr)
url1<-'https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx'
GET(url1, write_disk(tf <- tempfile(pattern = "APTA-Ridership-by-Mode-and-Quarter-1990-Present", fileext = ".xlsx", tmpdir = "../data")))
df <- read_excel(tf, 2L)
str(df)

Quarterly and Annual Ridership Totals by Mode​ of Transportation

Census Data for Commute to Work 2

Note: Due to the size of this data, it will not be hosted on Github. The raw data can be accessed using this link.

In answering the question of whether or not public transit’s public service should be the paramount consideration for its efficacy, it is important to understand that it often provides service disproportionally to underprivileged groups. Therefore, a source of data that will be useful is a survey dataset from IPUMS, which has millions of survey responses from the U.S. Census Bureau. We will be using the survey data gathered from 2021, the most recent sample available, and one with a significantly large volume of responses. The main reason for obtaining this data is the presence of a “Means of transportation to work” field, which will serve as our labels for supervised learning. This can tell us the commute methods for a large sample of respondents, whom we can analyze by looking at demographic and lifestyle data to gather insights on what factors into one’s means of transportation to work.

This data was gathered via the IPUMS website by selecting a sample (2021) and specifying fields that will be required for further analysis. IPUMS provides numeric codes for categorical variables and to represent NA values, the meanings of which are specified on their website. This will be useful in cleaning the data. The fields for this data extract are below:

Survey Extract Fields

After extracting, the data appears as the following:

Survey Raw Dataset

Public Transit Data by City3

Just as not every city was equally impacted by the COVID-19 pandemic, the performance of public transit differs drastically depending on where one goes in the US. Our goal here is to assess what factors impact public transit ridership and cost-effectiveness for a city to determine any action items that can be taken to improve such metrics. Every quarter, the American Public Transit Association (APTA) conducts a Ridership Report which includes key performance indicators of public transit systems across the US. Our analysis will use data from their Third Quarter 2023 Ridership Report, which is the most recent one available at the time of writing this.

The observational unit for this data is not city, but rather type of transportation. Therefore, any given city may have several rows, possibly split into Agency, as multiple organizations may serve a single city to provide service to its residents. These agencies are also split by row, corresponding to the mode of transportation. Another notable aspect of this data is that it includes general information about each city, such as population and area. In this case, records have the same value in these columns for each city, so when cleaning, we can consolidate records by city and not worry about conflating differing values. This will allow us to not only focus on features of public transit systems in a vacuum, but also observe characteristics of cities that may lead to various public transit phenomena. A screenshot of the raw data gathered from APTA is below:

Public Transit Data by City

Yelp Reviews

Gauging public sentiment regarding public transit systems can be a great way to analyze the relationship between said system and the residents of its respective city. Regardless of external factors, consumer dissatisfaction of a mode of transportation could greatly influence its usage when other methods are readily available for many. Thus, these datasets will feature Yelp reviews of the top seven most used public transit systems in the US, as measured by the data gathered from the American Public Transit Association. These public transit systems are:

  • Metropolitan Transit Authority (New York City)4
  • Los Angeles County Metropolitan Transportation Authority5
  • Chicago Transit Authority6
  • Southeastern Pennsylvania Transportation Authority (Philadelphia)7
  • Massachusetts Bay Transportation Authority (Boston)8
  • Bay Area Rapid Transit (San Francisco Bay Area)9
  • Washington Metropolitan Area Transit Authority (Washington, D.C.)10

This data will include the date of review, the exact text, and the associated numerical rating (1-5 stars). Gathering labeled text data will be invaluable for Naive Bayes classification in the future. To accomplish this, I will use the BeautifulSoup package in Python, which facilitates web scraping via HTML codes. Since the reviews span several pages, it is necessary to iterate over each page to obtain every review available to us. Therefore, I adapted a generalized function for both reading in the data and storing it as a pandas dataframe.11 The code for that function is below:

Code
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
def get_yelp(url, pages):   
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    hold1 = soup.find(string='Recommended Reviews')
    if hold1 is not None:
        reviews = hold1.find_parent('section')
    num_rating1 = []
    review_date1 = []
    review_text1 = []
    for review in reviews.select('div[aria-label$="star rating"]'):
        num_rating1.append(review['aria-label'])
        review_date1.append(review.find_next('span').text)
        hold = review.find_next('span', lang=True)
        if hold is None:
            review_text1.append("NA")
        else:
            review_text1.append(hold.text)

    for i in range(1,pages):
        r = requests.get(url + '?start=' + str(i) + '0')
        soup = BeautifulSoup(r.text, 'html.parser')
        hold1 = soup.find(string='Recommended Reviews')
        if hold1 is not None:
            reviews = hold1.find_parent('section')
        for review in reviews.select('div[aria-label$="star rating"]'):
            num_rating1.append(review['aria-label'])
            review_date1.append(review.find_next('span').text)
            hold = review.find_next('span', lang=True)
            if hold is None:
                review_text1.append("NA")
            else:
                review_text1.append(hold.text)
    reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))
    return(reviews)

Upon creating and running the function above, we can now call it using our seven most used public transit systems in the US. The inputs for this function are the following:

  • The URL for the first page on Yelp
  • The number of pages of reviews

The code for calling this function is below, with an example of the output attached:

Code
get_yelp('https://www.yelp.com/biz/metropolitan-transportation-authority-new-york-6',14).to_csv('../data/yelp_reviews/mta_reviews.csv')
get_yelp('https://www.yelp.com/biz/metro-los-angeles-los-angeles',18).to_csv('../data/yelp_reviews/la_reviews.csv')
get_yelp('https://www.yelp.com/biz/chicago-transit-authority-chicago-6',38).to_csv('../data/yelp_reviews/cta_reviews.csv')
get_yelp('https://www.yelp.com/biz/septa-philadelphia-7',10).to_csv('../data/yelp_reviews/septa_reviews.csv')
get_yelp('https://www.yelp.com/biz/massachusetts-bay-transportation-authority-boston',34).to_csv('../data/yelp_reviews/mbta_reviews.csv')
get_yelp('https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2',101).to_csv('../data/yelp_reviews/bart_reviews.csv')
get_yelp('https://www.yelp.com/biz/wmata-washington',10).to_csv('../data/yelp_reviews/wmata_reviews.csv')

WMATA Yelp Reviews

Footnotes

  1. “Ridership Report.” American Public Transportation Association, 21 Sept. 2023, www.apta.com/research-technical-resources/transit-statistics/ridership-report/.↩︎

  2. Steven Ruggles, Sarah Flood, Matthew Sobek, Danika Brockman, Grace Cooper, Stephanie Richards, and Megan Schouweiler. IPUMS USA: Version 13.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V13.0↩︎

  3. “Raw monthly ridership (no adjustments or estimates),” Raw Monthly Ridership (No Adjustments or Estimates) | FTA, https://www.transit.dot.gov/ntd/data-product/monthly-module-raw-data-release (accessed Nov. 14, 2023).↩︎

  4. “Metropolitan Transportation Authority - New York, NY,” Yelp, https://www.yelp.com/biz/metropolitan-transportation-authority-new-york-6 (accessed Nov. 14, 2023).↩︎

  5. “Metro Los Angeles - Los Angeles, CA,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 14, 2023).↩︎

  6. “Chicago Transit Authority - Chicago, IL,” Yelp, https://www.yelp.com/biz/metro-los-angeles-los-angeles (accessed Nov. 14, 2023).↩︎

  7. “Septa - Philadelphia, PA,” Yelp, https://www.yelp.com/biz/septa-philadelphia-7 (accessed Nov. 14, 2023).↩︎

  8. “Massachusetts Bay Transportation Authority - Boston, MA,” Yelp, https://www.yelp.com/biz/massachusetts-bay-transportation-authority-boston (accessed Nov. 14, 2023).↩︎

  9. “WMATA - Washington, DC, DC,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 2, 2023).↩︎

  10. “Bart - Bay Area Rapid Transit - Oakland, CA,” Yelp, https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2 (accessed Nov. 2, 2023).↩︎

  11. “Is it actually possible to scrape reviews from yelp with Beautifulsoup?,” Reddit, https://www.reddit.com/r/learnpython/comments/d5a71p/comment/f0mi983/?utm_source=share&utm_medium=web2x&context=3 (accessed Dec. 5, 2023).↩︎

  12. Barrero, Jose Maria, et al. Why Working from Home Will Stick, 2021, https://doi.org/10.3386/w28731.↩︎